Skip to content

refactor: logical instead of physical extension array#6409

Draft
universalmind303 wants to merge 1 commit intomainfrom
extension-logical-refactor
Draft

refactor: logical instead of physical extension array#6409
universalmind303 wants to merge 1 commit intomainfrom
extension-logical-refactor

Conversation

@universalmind303
Copy link
Member

@universalmind303 universalmind303 commented Mar 16, 2026

Changes Made

refactors ExtensionArray into a logical wrapper instead of a physical type. This closely aligns with how arrow-rs and pyarrow treat extensions (purely metadata, not a dedicated datatype)

Related Issues

Closes #6408

@universalmind303 universalmind303 requested a review from a team as a code owner March 16, 2026 22:38
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR refactors ExtensionArray from a physical DataArray<ExtensionType> (backed directly by an Arrow extension array) into a logical wrapper that stores a typed physical Series alongside extension metadata (name, metadata). This aligns Daft's extension type treatment with how arrow-rs and PyArrow handle extensions — as purely metadata with no dedicated physical representation.

Key changes:

  • New ExtensionArray struct in array/extension_array.rs wraps a physical: Series with extension name/metadata fields extracted from the DataType::Extension dtype.
  • A new ExtensionGrowable replaces the old ArrowGrowable<ExtensionType>, delegating to a physical growable and re-wrapping on build().
  • ExtensionType is now a thin tag struct implementing only DaftDataType, correctly removed from with_match_physical_daft_types! and with_match_arrow_daft_types!.
  • DataType::Extension is added to to_physical() (delegates to storage type) and is_nested().
  • Serialization format change: The old serializer stored the physical series under field name "physical"; the new serializer stores self.physical directly (using the column name). Old serialized data deserialized by new code will produce an ExtensionArray with a silent mismatch between field.name and physical.name(), potentially causing subtle bugs in downstream name-dependent operations.
  • agg_list/agg_set delegate to the physical series and return List<storage_type> instead of List<Extension(...)>, silently stripping extension metadata from grouped aggregations.
  • to_arrow().unwrap() introduced in get.rs replaces the previously infallible call, risking a runtime panic if the extension type's Arrow conversion fails.
  • The PR title is marked refactor: but the serialization format change is a breaking change for persisted data, which per project convention should be indicated as refactor!:.

Confidence Score: 3/5

  • The core refactor design is sound but introduces a potential panic in get.rs, silent extension-type loss in aggregations, and a serialization backward-compatibility issue.
  • The architectural direction is correct and well-aligned with the stated goal. However, three issues lower confidence: (1) to_arrow().unwrap() in get.rs can panic where the old code was infallible; (2) agg_list/agg_set strip extension metadata from results; (3) the serialization format change (physical series field name) breaks reading old serialized data.
  • src/daft-core/src/series/array_impl/extension_array.rs (aggregation type loss), src/daft-core/src/array/ops/get.rs (unwrap panic), src/daft-core/src/series/serdes.rs (backward compat)

Important Files Changed

Filename Overview
src/daft-core/src/array/extension_array.rs New logical ExtensionArray struct wrapping a physical Series. Core design is sound; to_arrow correctly casts using field metadata. Minor issue: no validation that physical.name() == field.name in constructor.
src/daft-core/src/series/array_impl/extension_array.rs New SeriesLike impl for ExtensionArray. agg_list, agg_set, min, and max all return physical-typed series, stripping extension metadata. if_else has a silent fallback for non-extension inputs.
src/daft-core/src/array/ops/get.rs Changed is_valid to delegate to self.physical.is_valid. Added .unwrap() on the now-fallible to_arrow(), introducing a potential runtime panic.
src/daft-core/src/series/serdes.rs Simplified deserialization directly constructs ExtensionArray::new, but creates a field-name/physical-name mismatch when reading data serialized by the old format (which used the hardcoded name "physical").
src/daft-core/src/array/serdes.rs Serialization simplified to use self.physical directly instead of going through Arrow conversion; changes the serialized field name from "physical" to the column name, which may break reading old data.
src/daft-core/src/array/growable/extension_growable.rs New ExtensionGrowable correctly delegates to a physical growable and re-wraps the result in ExtensionArray on build(). Looks correct.
src/daft-core/src/datatypes/mod.rs Replaces DataArray<ExtensionType> with a custom ExtensionType struct implementing only DaftDataType, correctly removing it from physical/Arrow-backed type families. get_dtype() returning DataType::Unknown is a reasonable sentinel.
src/daft-schema/src/dtype.rs Added Extension to to_physical() (delegates to the storage type) and to is_nested(). Both additions are correct and complete the type system integration.
src/daft-core/src/datatypes/matching.rs Correctly removes Extension from with_match_physical_daft_types! and with_match_arrow_daft_types! since it is now a logical type; with_match_daft_types! (unchanged) still handles it via ExtensionType.
src/daft-core/src/array/ops/null.rs New is_valid for ExtensionArray follows the same pattern as FixedSizeListArray/ListArray/StructArray. Correct.
src/daft-ext-core/src/function.rs Trivial fix: removed unused mut on args parameter, fixing a clippy warning per project convention.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class ExtensionArray {
        +Arc~Field~ field
        +Arc~str~ extension_name
        +Option~Arc~str~~ metadata
        +Series physical
        +new(field, physical) Self
        +to_arrow() DaftResult~ArrayRef~
        +slice(start, end) DaftResult~Self~
        +concat(arrays) DaftResult~Self~
        +with_physical(physical) Self
        +rename(name) Self
    }

    class Series {
        +Arc~dyn SeriesLike~ inner
    }

    class ExtensionGrowable {
        +String name
        +DataType dtype
        +Box~dyn Growable~ physical_growable
        +extend(index, start, len)
        +add_nulls(additional)
        +build() DaftResult~Series~
    }

    class ExtensionType {
        +get_dtype() DataType
        ArrayType: ExtensionArray
    }

    class DaftDataType {
        <<trait>>
        +get_dtype() DataType
        +ArrayType
    }

    class SeriesLike {
        <<trait>>
        +to_arrow() DaftResult~ArrayRef~
        +cast(datatype) DaftResult~Series~
        +filter(mask) DaftResult~Series~
        +agg_list(groups) DaftResult~Series~
    }

    DaftDataType <|-- ExtensionType
    SeriesLike <|.. ArrayWrapper~ExtensionArray~
    ExtensionArray --> Series : wraps physical
    ExtensionGrowable --> ExtensionArray : builds
    ExtensionType --> ExtensionArray : ArrayType
Loading

Last reviewed commit: 71bf5b1

);
let is_valid = self.is_valid(idx);
if is_valid {
if self.physical.is_valid(idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential panic on to_arrow() failure

to_arrow() on ExtensionArray is now fallible — it performs a field lookup (which can fail if the extension type has no Arrow representation) and potentially a cast. The previous implementation was infallible (it returned the inner Arrow array directly). Using unwrap() here will panic at runtime if the cast fails, rather than returning None gracefully.

Consider propagating the error or returning None:

Suggested change
if self.physical.is_valid(idx) {
let scalar = Scalar::new(scalar.to_arrow().ok()?);

Comment on lines +55 to +62
fn agg_list(&self, groups: Option<&GroupIndices>) -> DaftResult<Series> {
self.0.physical.agg_list(groups)
}

fn agg_set(&self, groups: Option<&GroupIndices>) -> DaftResult<Series> {
self.0.physical.agg_set(groups)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agg_list and agg_set discard extension type info

Both agg_list and agg_set delegate directly to self.0.physical, which returns a List<storage_type> (e.g., List<Binary>) instead of List<Extension(...)>. This silently strips the extension metadata from the result, causing type mismatches for any downstream code that expects List<Extension> from aggregating an ExtensionArray.

min and max have the same issue — they return the physical type rather than an ExtensionArray. For completeness, agg_list/agg_set at minimum should wrap the result back with the extension metadata since they produce a list aggregation of the same type.

Comment on lines 154 to +156
DataType::Extension(..) => {
let physical = map.next_value::<Series>()?;
let physical = physical.to_arrow().unwrap();

Ok(ExtensionArray::from_arrow(Arc::new(field), physical)
.unwrap()
.into_series())
Ok(ExtensionArray::new(Arc::new(field), physical).into_series())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization format change breaks backward compatibility with old data

The old serialize for ExtensionArray stored the physical series with the hardcoded field name "physical":

Field::new("physical", inner.as_ref().clone())

The new serialize stores self.physical directly, which has the actual column name as its field name.

When deserializing old-format data with this new code, map.next_value::<Series>() would produce a series named "physical", while field.name is the actual column name (e.g., "my_col"). The resulting ExtensionArray is then created with a field named "my_col" and a physical series named "physical" — a silent mismatch.

This mismatch propagates to any code that calls self.physical.name() expecting the column name (e.g., display, schema inspection). Consider validating that physical.name() == field.name in ExtensionArray::new, or documenting this constraint.

Comment on lines +88 to +97
fn if_else(&self, other: &Series, predicate: &Series) -> DaftResult<Series> {
let other_physical = match other.downcast::<ExtensionArray>() {
Ok(other_ext) => &other_ext.physical,
Err(_) => other,
};
Ok(self
.0
.with_physical(self.0.physical.if_else(other_physical, predicate)?)
.into_series())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if_else with a non-extension other may silently produce wrong types

When other is not an ExtensionArray, the code falls back to using other directly:

Err(_) => other,

This passes the raw other series (of an arbitrary type) into self.0.physical.if_else(other_physical, predicate). While the inner if_else may fail with a type mismatch error, this isn't immediately obvious from the call site. It would be clearer to return an explicit DaftError::TypeError when other is not an ExtensionArray, rather than falling through to a potentially confusing error from the physical layer.

@universalmind303 universalmind303 marked this pull request as draft March 16, 2026 22:54
@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 47.03704% with 143 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.68%. Comparing base (6bec12d) to head (71bf5b1).

Files with missing lines Patch % Lines
...daft-core/src/series/array_impl/extension_array.rs 51.30% 56 Missing ⚠️
src/daft-core/src/array/extension_array.rs 62.74% 38 Missing ⚠️
...daft-core/src/array/growable/extension_growable.rs 0.00% 29 Missing ⚠️
src/daft-core/src/array/growable/mod.rs 0.00% 9 Missing ⚠️
src/daft-core/src/array/ops/null.rs 0.00% 5 Missing ⚠️
src/daft-core/src/datatypes/mod.rs 0.00% 3 Missing ⚠️
src/daft-core/src/array/ops/get.rs 0.00% 2 Missing ⚠️
src/daft-schema/src/dtype.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6409      +/-   ##
==========================================
- Coverage   74.78%   74.68%   -0.10%     
==========================================
  Files        1020     1023       +3     
  Lines      136319   136557     +238     
==========================================
+ Hits       101949   101994      +45     
- Misses      34370    34563     +193     
Files with missing lines Coverage Δ
src/daft-core/src/array/mod.rs 90.07% <ø> (-0.40%) ⬇️
src/daft-core/src/array/ops/broadcast.rs 55.39% <ø> (ø)
src/daft-core/src/array/ops/get_lit.rs 84.28% <100.00%> (ø)
src/daft-core/src/array/serdes.rs 70.40% <100.00%> (-0.60%) ⬇️
src/daft-core/src/series/array_impl/data_array.rs 97.11% <ø> (+0.71%) ⬆️
src/daft-core/src/series/serdes.rs 61.08% <100.00%> (-0.53%) ⬇️
src/daft-ext-core/src/function.rs 85.71% <100.00%> (ø)
src/daft-schema/src/dtype.rs 86.07% <0.00%> (-0.65%) ⬇️
src/daft-core/src/array/ops/get.rs 84.34% <0.00%> (+0.36%) ⬆️
src/daft-core/src/datatypes/mod.rs 10.34% <0.00%> (-1.20%) ⬇️
... and 5 more

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: ExtensionArray from DataArray to logical array backed by Series

1 participant